Language Annotations of the Early Web (1996-1999)
There Is No Preview Available For This Item
This item does not appear to have any files that can be experienced on Archive.org.
Please download files in this item to interact with them on your computer.
Show all files
dataset consists of metadata records identifying the language of
webpages from the early web (1996-1999) portion of the Internet
Archive’s web archive collection. The metadata includes specific capture
information related to a URL along with the top 3 language annotations
for each page, as detected by Google’s Compact Language Detector (CLD3),
a neural network model for language identification. Confidence scores
are computed according to Internet Archive’s own Square-Leaf-Model,
based on a combination of the scores provided by CLD3, applied to each
HTML DOM-tree leaf node text, and the length of the analyzed text
snippets. There is an overall number of 4,383,611 language annotated
records in this dataset. Additional details on format, statistics,
samples, et cetera can be found in the dataset's README file.
- 2021-01-08 14:18:50
- Internet Archive Python library 1.8.1
Uploaded by Helge Holzmann on